Introduction
Vancouver Trees Study
Author - Blair Cheredaryk. Published on October 24th, 2023
We will be taking a look at the type of tree species in the city of Vancouver, Canada. The dataset is taken from Vancouver Street Trees dataset, accessed through this link. I want to take a look at the species and how they relate to the different neighbourhoods. A few questions I would like to answer are
The data were obtained from The city of Vancouver's Open Data Portal and follows an Open Government License
import pandas as pd
import altair as alt
Lets read in the dataframe from the website given.
van_trees_df = pd.read_csv('https://raw.githubusercontent.com/UBC-MDS/data_viz_wrangled/main/data/Trees_data_sets/small_unique_vancouver.csv', parse_dates=['date_planted'])
van_trees_df.head()
| Unnamed: 0 | std_street | on_street | species_name | neighbourhood_name | date_planted | diameter | street_side_name | genus_name | assigned | ... | plant_area | curb | tree_id | common_name | height_range_id | on_street_block | cultivar_name | root_barrier | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10747 | W 20TH AV | W 20TH AV | PLATANOIDES | Riley Park | 2000-02-23 | 28.5 | EVEN | ACER | N | ... | 15 | Y | 21421 | NORWAY MAPLE | 4 | 0 | NaN | N | 49.252711 | -123.106323 |
| 1 | 12573 | W 18TH AV | W 18TH AV | CALLERYANA | Arbutus-Ridge | 1992-02-04 | 6.0 | ODD | PYRUS | N | ... | 7 | Y | 129645 | CHANTICLEER PEAR | 2 | 2300 | CHANTICLEER | N | 49.256350 | -123.158709 |
| 2 | 29676 | ROSS ST | ROSS ST | NIGRA | Sunset | NaT | 12.0 | ODD | PINUS | N | ... | 7 | Y | 154675 | AUSTRIAN PINE | 4 | 7800 | NaN | N | 49.213486 | -123.083254 |
| 3 | 8856 | DOMAN ST | DOMAN ST | AMERICANA | Killarney | 1999-11-12 | 11.0 | EVEN | FRAXINUS | N | ... | 7 | Y | 180803 | AUTUMN APPLAUSE ASH | 4 | 6900 | AUTUMN APPLAUSE | N | 49.220839 | -123.036721 |
| 4 | 21098 | EAST BOULEVARD | EAST BOULEVARD | HIPPOCASTANUM | Shaughnessy | NaT | 15.5 | ODD | AESCULUS | Y | ... | N | Y | 74364 | COMMON HORSECHESTNUT | 4 | 5200 | NaN | N | 49.238514 | -123.154958 |
5 rows × 21 columns
van_trees_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 5000 non-null int64 1 std_street 5000 non-null object 2 on_street 5000 non-null object 3 species_name 5000 non-null object 4 neighbourhood_name 5000 non-null object 5 date_planted 2363 non-null datetime64[ns] 6 diameter 5000 non-null float64 7 street_side_name 5000 non-null object 8 genus_name 5000 non-null object 9 assigned 5000 non-null object 10 civic_number 5000 non-null int64 11 plant_area 4950 non-null object 12 curb 5000 non-null object 13 tree_id 5000 non-null int64 14 common_name 5000 non-null object 15 height_range_id 5000 non-null int64 16 on_street_block 5000 non-null int64 17 cultivar_name 2658 non-null object 18 root_barrier 5000 non-null object 19 latitude 5000 non-null float64 20 longitude 5000 non-null float64 dtypes: datetime64[ns](1), float64(3), int64(5), object(12) memory usage: 820.4+ KB
Let's describe the data to see what we find.
We will be using the species_name, street_side_name, height_range_id, and neighbourhood_name cols in this presentation. These give us great data to work with as there are no missing values in each col.
van_trees_df.describe()
| Unnamed: 0 | diameter | civic_number | tree_id | height_range_id | on_street_block | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|
| count | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.00000 | 5000.000000 | 5000.000000 | 5000.000000 |
| mean | 14861.920400 | 12.340888 | 2975.707600 | 128682.584600 | 2.73440 | 2960.227000 | 49.247349 | -123.107128 |
| std | 8680.023278 | 9.266600 | 2078.580429 | 75412.260406 | 1.56957 | 2086.861052 | 0.021251 | 0.049137 |
| min | 2.000000 | 0.000000 | 2.000000 | 36.000000 | 0.00000 | 0.000000 | 49.202783 | -123.220560 |
| 25% | 7192.750000 | 4.000000 | 1300.500000 | 61321.500000 | 2.00000 | 1300.000000 | 49.230152 | -123.144178 |
| 50% | 14870.000000 | 10.000000 | 2639.000000 | 130130.500000 | 2.00000 | 2600.000000 | 49.247981 | -123.105861 |
| 75% | 22366.750000 | 18.000000 | 4123.000000 | 191332.000000 | 4.00000 | 4100.000000 | 49.263275 | -123.063484 |
| max | 29992.000000 | 71.000000 | 9113.000000 | 270750.000000 | 9.00000 | 9100.000000 | 49.293930 | -123.023311 |
A few takeaways from the describe function:
1. There are 5000 trees
2. The diamter ranges from 0 (must be an ommitted value) to 71.
3. The min tree height is 0 (must be ommitted which throws off the data)
4. The max tree height is 9
5. The average diamter of all the trees is 12.34
Rubrum - Killarney and Riley Park have the tallest at 6 metres and its on the odd side of the streetA few questions I would like to answer are
1. What are the 4 most populated tree species?
2. Which neighbourhood has the most and the least of each of these species?
3. Does it matter what side of the street the tree lives on for height?
4. Which neighbourhood has the tallest and shortest tree and what side of the street is it on?
5. Which neighbourhood has the most over amount of trees and which has the least?
# Chart the number of trees total for each neighbourhood
chart1 = alt.Chart(van_trees_df).mark_bar().encode(
x = alt.X('count()', title='Number of Trees'),
y = alt.Y('neighbourhood_name', title='Neighbourhood', sort='x')
).properties(title='Figure 1: Number of Trees in Each Neighbourhood')
chart1
From this chart we can see that Renfrew_Collingwood has the most trees and that Strathcona has the least.
Now lets see which are the most common tree species in Vancouver. I will filter the data so that we only use the species which have over 50 occurences.
species_counts_total = van_trees_df.groupby('species_name')['neighbourhood_name'].agg(['count'])
species_counts_filtered = species_counts_total.loc[species_counts_total['count'] > 50].reset_index()
species_counts_filtered
| species_name | count | |
|---|---|---|
| 0 | ACERIFOLIA X | 60 |
| 1 | AMERICANA | 182 |
| 2 | BETULUS | 170 |
| 3 | CALLERYANA | 78 |
| 4 | CAMPESTRE | 124 |
| 5 | CERASIFERA | 396 |
| 6 | EUCHLORA X | 152 |
| 7 | FLORIBUNDA | 66 |
| 8 | FREEMANI X | 127 |
| 9 | HIPPOCASTANUM | 69 |
| 10 | JAPONICA | 68 |
| 11 | JAPONICUM | 79 |
| 12 | KOBUS | 93 |
| 13 | PALUSTRIS | 66 |
| 14 | PENDULA | 68 |
| 15 | PENNSYLVANICA | 59 |
| 16 | PERSICA | 58 |
| 17 | PLATANOIDES | 444 |
| 18 | PSEUDOPLATANUS | 82 |
| 19 | RUBRUM | 261 |
| 20 | SERRULATA | 463 |
| 21 | STYRACIFLUA | 52 |
| 22 | SYLVATICA | 178 |
| 23 | TRIACANTHOS | 54 |
| 24 | TRUNCATUM | 70 |
| 25 | X YEDOENSIS | 90 |
| 26 | XX | 57 |
| 27 | ZUMI | 65 |
Lets chart this!
# Chart the number of tree species
chart2 = alt.Chart(species_counts_filtered, width=500, height=1000).mark_bar().encode(
x = alt.X('count', title='Number of Trees', scale=alt.Scale(domain=[0, 500])),
y = alt.Y('species_name', title='Species', sort='x')
).properties(title='Figure 2: Number of Trees in Each Neighbourhood')
chart2
From figure 2 we see that Surrulata, Platanoides, Cerasifera, and Rubrum are the four most populated trees species in Vancouver. Now lets find out which neighbourhoods have the top 4 species of trees. I'll filter out the top 4 species into a new df first.
# Filter the DataFrame to only include rows where the species_name is one of the 4 specified species
species_list = ['RUBRUM', 'CERASIFERA', 'PLATANOIDES', 'SERRULATA']
species_df = van_trees_df[van_trees_df['species_name'].isin(species_list)]
species_df.head()
| Unnamed: 0 | std_street | on_street | species_name | neighbourhood_name | date_planted | diameter | street_side_name | genus_name | assigned | ... | plant_area | curb | tree_id | common_name | height_range_id | on_street_block | cultivar_name | root_barrier | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10747 | W 20TH AV | W 20TH AV | PLATANOIDES | Riley Park | 2000-02-23 | 28.5 | EVEN | ACER | N | ... | 15 | Y | 21421 | NORWAY MAPLE | 4 | 0 | NaN | N | 49.252711 | -123.106323 |
| 11 | 20371 | W 45TH AV | W 45TH AV | CERASIFERA | Kerrisdale | NaT | 4.5 | ODD | PRUNUS | N | ... | 8 | Y | 205247 | NIGHT PURPLE LEAF PLUM | 2 | 2100 | NIGRA | N | 49.230925 | -123.156131 |
| 15 | 5416 | GOTHARD ST | CLARENDON ST | PLATANOIDES | Renfrew-Collingwood | 1994-11-08 | 3.0 | EVEN | ACER | Y | ... | 8 | Y | 156303 | GLOBEHEAD NORWAY MAPLE | 2 | 4700 | GLOBOSUM | N | 49.241778 | -123.054438 |
| 18 | 7526 | E 55TH AV | E 55TH AV | CERASIFERA | Victoria-Fraserview | NaT | 15.0 | ODD | PRUNUS | N | ... | 6 | Y | 48307 | PISSARD PLUM | 3 | 1600 | ATROPURPUREUM | N | 49.220019 | -123.072691 |
| 19 | 17945 | W 12TH AV | W 12TH AV | SERRULATA | Kitsilano | 2008-03-13 | 9.0 | ODD | PRUNUS | N | ... | 20 | Y | 106587 | SHIROTAE(MT FUJI) CHERRY | 1 | 2600 | SHIROTAE | N | 49.261319 | -123.164948 |
5 rows × 21 columns
# Create a bar chart of the neighbourhood_name for each of the 4 specified species
chart4 = alt.Chart(species_df, width=500, height=300).mark_bar().encode(
x = alt.X('count()', title='Number of Trees', scale=alt.Scale(domain=[0, 150])),
y = alt.Y('neighbourhood_name', title='Neighbourhood', sort='x'),
color = alt.Color('species_name', title='Species')
).properties(title='Figure 3: Number of Trees by Species and Neighbourhood')
chart4
Its hard to tell any data from figure 3, Its hard to see to compare the species to the neighbourhood with all the data stacked . Lets facet this into each species to get a clear look at the data
title_chart5 = alt.TitleParams('Figure 4: Number of Trees by Species and Neighbourhood', anchor='middle')
chart5 = alt.Chart(species_df, width=500, height=300).mark_bar().encode(
x = alt.X('count()', title='Number of Trees', scale=alt.Scale(domain=[0, 50])),
y = alt.Y('neighbourhood_name', title='Neighbourhood', sort='x'),
color = alt.Color('species_name', title='Species')
).facet('species_name', columns=2, title=title_chart5)
chart5
This is better but its still a bit hard to read as there are many neighbourhoods. We will add a few selection tools but first lets break this chart down into the sides of the street the trees live on to see if it clears up the data.
title_chart6 = alt.TitleParams('Figure 5: Number of Trees by Species and Neighbourhood', anchor='middle')
chart6 = alt.Chart(species_df, width=200, height=200).mark_bar().encode(
x = alt.X('count()', title='Number of Trees', scale=alt.Scale(domain=[0, 50])),
y = alt.Y('neighbourhood_name', title='Neighbourhood', sort='x'),
color = alt.Color('species_name', title='Species')
).properties(title='Number of Trees by Species and Neighbourhood'
).facet(column= 'species_name', row= 'street_side_name', title=title_chart6)
chart6
This looks great so lets add a title and subtitle then add some functionality!
stock_title = alt.TitleParams(
"Figure 6: Top Four Tree Species by Neighbourhood in Vancouver",
subtitle = "The location is based on the Odd or Even side of the street",
anchor='middle'
)
chart7 = alt.Chart(species_df, width=200, height=200).mark_bar().encode(
x = alt.X('count()', title='Number of Trees', scale=alt.Scale(domain=[0, 50])),
y = alt.Y('neighbourhood_name', title='Neighbourhood', sort='x'),
color = alt.Color('species_name', title='Species')
).facet(column= 'species_name', row= 'street_side_name', title=stock_title
).configure_header(labelFontSize=10, titleFontSize=0)
chart7
I like the color choices and how the legend tells us that but I think we can get more info on the number of trees of the top four species, there neighnbourhoods, and maybe one more stat such as the heights of the trees. Let's break this down into one plot to add a dropdown for the neighbourhoods, then we'll facet the tree species again for a clearer look at the data using the heights given as well.
stock_title = alt.TitleParams(
"Figure 7: Top Four Tree Species by Neighbourhood in Vancouver",
subtitle = "The location is based on the Odd or Even side of the street",
anchor='middle')
neighbourhood = sorted(species_df['neighbourhood_name'].unique())
dropdown_neighbourhood = alt.binding_select(name='Neighbourhood', options=neighbourhood)
select_neighbourhood = alt.selection_single(fields=['neighbourhood_name'], bind=dropdown_neighbourhood)
chart9 = alt.Chart(species_df).mark_circle(size=50).encode(
x= alt.X('count():Q', type = 'quantitative',scale=alt.Scale(domain=[0, 15]), title='Number of Trees'),
y= alt.Y('height_range_id:N', title='Tree Height Range', sort = 'descending'),
color=alt.Color('street_side_name', title="Street Side"),
opacity=alt.condition(select_neighbourhood, alt.value(0.8), alt.value(0.08)),
tooltip=['species_name:N', 'neighbourhood_name:N', 'street_side_name:N', 'height_range_id:Q']
).interactive(
).add_selection(select_neighbourhood
).transform_filter(select_neighbourhood
).properties(
title=stock_title,
width=350,
height=300,
).facet(column= 'species_name', title=stock_title
)
chart9
We can now see the data clearly but I would like to add some more interaction to each chart like zoom, color change based on selection but in order to do this, I'll have to create 4 dataframes for each species. Then I will add in a brush selection to highlight a certain part of the figure to show all the same height trees in a neighbourhood. The last selection tool I'll add is an ability to click on the color legend to isolate the odd, med, or even side of the street.
#from function in species_list, create 4 df for each tree species
from species_list import filter_species
Cerasifera = filter_species(van_trees_df, 'CERASIFERA')
from species_list import filter_species
Rubrum = filter_species(van_trees_df, 'RUBRUM')
from species_list import filter_species
Serrulata = filter_species(van_trees_df, 'SERRULATA')
from species_list import filter_species
Platanoides = filter_species(van_trees_df, 'PLATANOIDES')
stock_title = alt.TitleParams(
"Figure 8: Top Four Tree Species by Neighbourhood in Vancouver",
subtitle = "The location is based on the Odd or Even side of the street",
anchor='middle')
neighbourhood = sorted(species_df['neighbourhood_name'].unique())
dropdown_neighbourhood = alt.binding_select(name='Neighbourhood', options=neighbourhood)
select_neighbourhood = alt.selection_single(fields=['neighbourhood_name'], bind=dropdown_neighbourhood)
brush = alt.selection_interval()
click_legend = alt.selection_multi(fields=['street_side_name'], bind='legend')
chart11 = alt.Chart(Cerasifera).mark_circle(size=50).encode(
x= alt.X('count():Q', type = 'quantitative',scale=alt.Scale(domain=[0, 15]), title='Number of Trees'),
y= alt.Y('height_range_id:N', title='Tree Height Range', sort = 'descending'),
color=alt.condition(brush, 'street_side_name', alt.value('lightgray')),
opacity=alt.condition(click_legend, alt.value(0.8), alt.value(0.08)),
tooltip=['species_name:N', 'neighbourhood_name:N', 'street_side_name:N', 'height_range_id:Q']
).interactive(
).add_selection(select_neighbourhood, brush, click_legend
).transform_filter(select_neighbourhood
).properties(
title='Cerasifera',
width=350,
height=300,
)
chart11
# create four figures for comparison
stock_title = alt.TitleParams(
"Figure 8: Top Four Tree Species by Neighbourhood in Vancouver",
subtitle = "The location is based on the Odd or Even side of the street",
anchor='middle')
neighbourhood = sorted(species_df['neighbourhood_name'].unique())
dropdown_neighbourhood = alt.binding_select(name='Neighbourhood', options=neighbourhood)
select_neighbourhood = alt.selection_single(fields=['neighbourhood_name'], bind=dropdown_neighbourhood)
brush = alt.selection_interval()
click_legend = alt.selection_multi(fields=['street_side_name'], bind='legend')
chart11 = alt.Chart(Cerasifera).mark_circle(size=50).encode(
x= alt.X('count():Q', type = 'quantitative',scale=alt.Scale(domain=[0, 15]), title='Number of Trees'),
y= alt.Y('height_range_id:N', title='Tree Height Range', sort = 'descending'),
color=alt.condition(brush, 'street_side_name', alt.value('lightgray')),
opacity=alt.condition(click_legend, alt.value(0.8), alt.value(0.08)),
tooltip=['species_name:N', 'neighbourhood_name:N', 'street_side_name:N', 'height_range_id:Q']
).interactive(
).add_selection(select_neighbourhood, brush, click_legend
).transform_filter(select_neighbourhood
).properties(
title='Cerasifera',
width=350,
height=300,
)
chart12 = alt.Chart(Platanoides).mark_circle(size=50).encode(
x= alt.X('count():Q', type = 'quantitative',scale=alt.Scale(domain=[0, 15]), title='Number of Trees'),
y= alt.Y('height_range_id:N', title='Tree Height Range', sort = 'descending'),
color=alt.condition(brush, 'street_side_name', alt.value('lightgray')),
opacity=alt.condition(click_legend, alt.value(0.8), alt.value(0.08)),
tooltip=['species_name:N', 'neighbourhood_name:N', 'street_side_name:N', 'height_range_id:Q']
).interactive(
).add_selection(select_neighbourhood, brush, click_legend
).transform_filter(select_neighbourhood
).properties(
title='Platanoides',
width=350,
height=300,
)
chart13 = alt.Chart(Serrulata).mark_circle(size=50).encode(
x= alt.X('count():Q', type = 'quantitative',scale=alt.Scale(domain=[0, 15]), title='Number of Trees'),
y= alt.Y('height_range_id:N', title='Tree Height Range', sort = 'descending'),
color=alt.condition(brush, 'street_side_name', alt.value('lightgray')),
opacity=alt.condition(click_legend, alt.value(0.8), alt.value(0.08)),
tooltip=['species_name:N', 'neighbourhood_name:N', 'street_side_name:N', 'height_range_id:Q']
).interactive(
).add_selection(select_neighbourhood, brush, click_legend
).transform_filter(select_neighbourhood
).properties(
title='Serrulata',
width=350,
height=300,
)
chart14 = alt.Chart(Rubrum).mark_circle(size=50).encode(
x= alt.X('count():Q', type = 'quantitative',scale=alt.Scale(domain=[0, 15]), title='Number of Trees'),
y= alt.Y('height_range_id:N', title='Tree Height Range', sort = 'descending'),
color=alt.condition(brush, 'street_side_name', alt.value('lightgray')),
opacity=alt.condition(click_legend, alt.value(0.8), alt.value(0.08)),
tooltip=['species_name:N', 'neighbourhood_name:N', 'street_side_name:N', 'height_range_id:Q']
).interactive(
).add_selection(select_neighbourhood, brush, click_legend
).transform_filter(select_neighbourhood
).properties(
title='Rubrum',
width=350,
height=300,
)
((chart11 | chart12) & (chart13 | chart14)).properties(title = stock_title)
Ok, this is what I want to see! Here we can filter out the neighbourhood using the dropdown menu so that we could add in the heights of these trees as well. We can also use this chart to look at just the specific neighbourhood.
Observations using Figure 7
Original Questions to be answered
</li>
<li><font size="2" style='color:red;'>Does it matter what side of the street the tree lives on for height?</font> 8 neighbourhoods have the tallest tree on the even side whereas 9 neighbourhoods have the tallest tree on the odd, so this doesn't really tell us anything about street side mattering about tree height which makes sense when you look down a street, not often are the tree heights different from one side or the other.</li>
<li><font size="2" style='color:red;'>Which neighbourhood has the tallest what side of the street is it on?</font>
<ol>
<li>Platanoides - Kerrisdale has the tallest at 8 metres and its on the odd side of the street</li>
<li>Cerasifera - Many have the tallest at 4 m but Hastings is the only neighbourhood to have one at 4 on both sides of the street</li>
<li>Surulata - Riley Park has the tallest at 5 metres and its on the even side of the street</li>
<li> Rubrum - Killarney and Riley Park have the tallest at 6 metres and its on the odd side of the street</li>
</ol>
</li>
<li><font size="2" style='color:gray;'>Which neighbourhood has the most over amount of trees and which has the least?</font> Renfrew-Collingwood has the most trees and Strathcona has the least.</li>
</ol>